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Abstract. The reconstruction of the history of evolutionary genome- 
wide events among a set of related organisms is of great biological inter- 
est. A simplified model that captures only content modifying operations 
was introduced recently. It allows the small phylogeny problem to be for- 
mulated as an alignment problem. In this work we present a branch-and- 
cut algorithm for this so-called duplication-loss alignment problem. Our 
method clearly outperforms the existing ILP based method by several 
orders of magnitude. We define classes of valid inequalities and provide 
algorithms to separate them efficiently and prove the A r P-hardness of 
the duplication-loss alignment problem. 



1 Introduction 

In the course of evolution genome-wide changes either (i) rearrange the order 
of the genes or (ii) modify the content. The former class of changes result from 
inversions, transpositions, and translocations, the latter have an effect on the 
number of gene copies that are either inserted, lost, or duplicated. The recon- 
struction of the history of such events among a set of (related) organisms is of 
great biological interest, since it can help to reveal the genomic basis of pheno- 
types. In a recent work [1] the authors study the problem of inferring an ancestral 
genome, from which two given genomes have evolved by content-modifying op- 
erations of type (ii) only, namely through the duplication and loss of genes. A 
prominent example of a gene family that is continuously duplicated and lost is 
transfer RNA (tRNA) [21314) . Since tRNA is an essential element in the trans- 
lation of RNA into proteins, reconstructing their evolutionary history among 
species might lead to new insights into the translationary machinery. 

The consequences of the evolutionary model that only accounts for the du- 
plication and loss of genes is twofold. First of all, the order of genes is preserved 
and thus the problem can be cast into an alignment problem pTJ, which is in 
general favorable from a combinatorial perspective. Secondly, duplications and 
losses are asymmetric operations and thus an ancestral genome can immediately 
be obtained from a duplication/loss scenario. 
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Related work and our contribution. Holloway et al. [T] proposed an ap- 
proach for the comparison of genomes under the duplication and loss model that 
is based on an integer linear programming (ILP) formulation of the problem. 
While the method in pQ iteratively adds cycle constraints to the ILP, we have 
developed this idea further into a cutting plane algorithm. Exploiting insights 
into the combinatorial structure of the problem we introduce cuts that can be 
separated efficiently and which lead to a branch-and-cut algorithm that out- 
performs the previous method [1] by several orders of magnitude. The related 
problem of labeling a given alignment of two genomes by duplications and losses 
was recently shown to be APX-h&rd [5] . We show that the problem of finding a 
maximum parsimony ancestral genome of two given genomes is iVP-hard. 

2 Problem definition 

We start with some basic definitions that are adopted from [T] Given two Genomes 
G a and G b and a set of allowed evolutionary operations O a sequence of n op- 
erations € O that transforms G a into G b is called a evolutionary history 
OQa^Qb. Let c(Oi) define the cost of the i-th operation in G a^ G b, then the 
cost of OQa^ G b is defined as X)"=i c (^i)- ^ there exists some evolutionary history 
G a^ G b then G b is called a potential ancestor of G b . The cost C(G a — > G b ) 
to transform sequence G a into G b is then the minimal cost of all possible evolu- 
tionary histories transforming G a into G b . 

These definitions allow us to define the central problem in this work, the two 
species small phylogeny problem. 

Definition 1. Two species small phylogeny problem 

— Input: 

• Two Genomes G a and G b . 

• Set of allowed evolutionary operations O 

— Output: 

• Potential common ancestor G* minimizing the cost C(G* — > G a ) + 
C(G* -> G b ) 

In general the set of allowed evolutionary operations can include Rever- 
sals and Transpositions which change the genome organization, as well as 
Losses, Insertions and Duplications that modify the genome content. The 
model proposed by Holloway et al. only allows for the two operations Loss and 
Duplication defined as follows: 

— A Duplication of size HI on genome G = G\ ■ ■ • G n copies a substring 
G[i, . . . , i + k] (origin) to some location j of G outside the interval [i, i + k] 
(target). 

— A Loss of size k + 1 removes a substring G[i, . . . , i + k] from G. 

As both operations do not shuffle gene order Holloway et al. suggest to pose 
the two-species small phylogeny problem for the duplication and loss model as 
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an alignment problem. Since the alignment of two extant genomes can only cover 
visible evolutionary operations Holloway et al. define a so called visible history 
and visible ancestor. A visible history is an evolutionary history is defined as 
a triplet {A, Oa^x, Oa^y) with Oa^x and Oa^y being evolutionary histories 
with the following property: For every duplication operation in Oa^x (resp. 
Oa^y) with origin S and target T it holds that this duplication is not followed 
by any other operation that will change the content in S or T. The genome A 
is then called a visible ancestor of X and Y. 

In order to solve the two species small phylogeny problem as an alignment 
problem Holloway emphet al. first define the labeling of an alignment as follows: 

Definition 2. Labeling (of an alignment): 

Given an alignment of two genomes G a and G b . The interpretation of this align- 
ment as a sequence of losses and duplications is called a labeling. The cost of 
the labeling is the summed cost of all duplication and loss operations. 

Holloway et al. show that there exist a one-to-one correspondence between a 
labeled alignments of two genomes X and Y and visible ancestors of X and Y. 
Therefore in order to solve the two species small phylogeny problem for loss and 
duplication, in the first step they solve the so called Duplication-Loss Alignment 
Problem which they defined as: 

Definition 3. Duplication- Loss Alignment Problem: 

Given two Genomes G a and G b , compute a labeled alignment of G a and G b 
with minimum cost. 

Once this problem is solved the computation of the common ancestor that solves 
the small phylogeny problem is straight forward pQ. 

2.1 Formal Problem Description 

When given two genomes G a of length n and G b of length m, we first construct 
the alignment graph T — (V, E), with V = V a U V . This graph is a complete bi- 
partite graph containing two sets of nodes V a = v", . . . ,v% and V b — v b , . . . ,v^ n . 
Some node vf corresponds to the i-th gene (G a [i] ) in genome G a and some node 
Vj corresponds to the j-th gene (G & [j]) in genome G b . The set E is the set of 
undirected edges, one for each pair of vertices {vf, v b }. Thus an undirected edge 
{vf, v b } corresponds to the alignment of gene at position i in G a and the gene at 
position j in G b . The cost associated to the alignment of these two genes is cy. 
If some alignment of G a and G b aligns gene G a [i] to G b [j], the corresponding 
edge {vf,Vj} is said to be realized by the alignment. Depending on the selected 
scoring scheme not every pair of genes from G° and G b are allowed to be aligned, 
which would correspond to assigning a cost of oo. For the problem of genome 
alignment usually every gene of G° can only be aligned the same gene in G b 
(with cost zero), such that only a very small subset of all alignment edges has a 
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cost < oo. Thus we work on a sparse version of the alignment graph, where E 
contains only those undirected edges {d°, v b } where c,j < oo. 

In a valid alignment of course only a subset of all alignment edges can be 
realized. Two alignment edges e\ — {vf,v b } and e 2 = are incompatible 

if (i < k and I < j) or (i > k and I > j). We denote that by {ei, G I, with 
1 being the set of all pairs of incompatible alignment edges. 

Additionally we construct the sets of all possible duplications and losses for 
genome G a (D a and L a ) and genome G b (D b and L b ). For a duplication d e D a 
with origin G a [i, . . . , i + k] and target G a [j, . . . , j + k] we define the functions 
origin(d) — [i, . . . ,i + k] and target(d) = [j, ...,j + k]. Similarly for some loss 
I G L a that removes substring G a [i, . . . , i + k] from G a we define the function 
span{l) = [i, . . . ,i + k]. According to the chosen scoring scheme every duplication 
operation d G D a U D b is charged some cost a- The same holds for every loss 
operation I G L a U L b where we charge some cost c; . 

Definition 4. Duplication- Cycle: 

A set of k duplication events D' C D a (D b resp.) forms a duplication cycle iff 
there exists permutation d\ , d<i , . . . , dk of the elements of D' such that 

(a) origin(di) D target(di-\) ^ V2 < i < k 

(b) origin{d\) fl target(dk) ^ 

Proposition 1. A valid labeling of an alignment does not contain a duplication 
cycle. 

Intuitively given a set of duplications d\ , . . . , dk such that for consecutive entries 
di,di+i the regions target{di) and origin{di + \) overlap. A reasonable biologi- 
cal interpretation induces a strict chronological partial order to the duplication 
events. Thus for every duplication event rf,, 1 < i < k it must hold that di hap- 
pened after di~\ and before di+\ which we denote by di-\ < di < This 
implies that no duplication cycle exists as it would contradict the strict partial 
order property of the duplication events and thus has no reasonable biological 
interpretation. 

3 ILP formulation and valid inequalities 

For the remainder of this paper we will consider only losses of size 1 and simplify 
the notation such that for every node v G V a the loss event l v G L a denotes the 
loss of the gene from G° corresponding to node v. The same holds for nodes in 
V b . 

3.1 An initial ILP model 

The formulation contains a binary variable: 

— Xij for every alignment edge e G E, e — {vf,v b }. 

— z v for every possible loss event l v G L a U L b ,. 
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— yd for every possible duplication d £ D a U D b . 

In a valid solution to the duplication loss alignment problem every gene in 
Genome G a and G must either be aligned to some gene in the other genome, 
labeled as a loss or labeled as the target of some duplication. Additionally for 
both genomes there must not exist any duplication cycle. For readability reasons 
we define the set D* as the set of all duplication cycles in D a and D b , that is 
D* = {D 1 C D a L)D b : D' forms a duplication cycle}. When we set all alignment 
edge costs to zero, the following ILP formulation equals the one by Holloway et 
al. and solves the duplication-loss-alignment problem. 



min CijXij + E c u z u + ^ c v z v + ^ c d y d + ^ c d y d (1) 

{vf,v*j}£E ueV" v£V b d£D a d £D b 

w.r.t. 



Xij + X k l < 1 



{v a ,v b }£E d£D a 
z J i£:target(d) 



\/{{vt,v b j}, {vk,vt}} e X 
VI < i < n 



VI < j < m 



j£target(d) 



d£D' 

x,y,z G {0, 1} 



VD' e D* 



(2) 
(3) 

(4) 

(5) 
(6) 



A solution to the ILP JI])-© corresponds to a solution to the duplication- 
loss alignment problem. But solving the ILP formulation above is not feasible for 
realistic values of m and n as the number of possible duplication cycles may grow 
exponentially with the length of the genomes. Therefore instead of enumerating 
all inequalities of class ([5]) beforehand, our approach and the one of Holloway et 
al. is to first relax the ILP and drop the duplication cycle constraints ([S]). 

Holloway et al. solve the problem by iteratively solving the ILP formulation 
(without constraint ([5])) and searching for violated duplication cycle inequalities 
which are then added to the ILP which is re-solved. These two steps are repeated 
until the solution of the ILP does not induce any duplication cycles. 

In contrast, our cutting plane approach explained in Section 2] does not it- 
eratively solve the ILP formulation. Instead we search for violated duplication 
cycle inequalities at every node of the branch-and-cut tree of the ILP solver 
and add them as cutting planes. Additionally we also identified other classes of 
valid inequalities that lead to a stronger LP relaxation of the ILP ([I])-© and 
thus can significantly speed up the solving process. In Section [3. 2 1 we define the 
classes of valid inequalities and in Section 14.11 we show how to efficiently solve 
the separation problem for each class. That is, given a (fractional) solution to 
the LP relaxation, identify a violated valid inequality. 
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3.2 Valid Inequalities 

In the following we let V be the convex hull of the feasible solutions to the above 
ILP. To define valid inequalities for V : we call pairs of alignments and/or du- 
plications incompatible if and only if a feasible solution cannot assign a value of 
1 to both of their corresponding variables. Incompatibility follows directly from 
constraints ([2|) or ([3]) and (j4|) in the ILP formulation. Then the incompatibility 
graph H has node set EUD and an edge between all pairs of incompatible align- 
ments and duplications. Similar to the multiple sequence alignment approach in 
[6] we introduce maximal clique inequalities. 

Maximal clique inequalities Sets K = Ke U Ke of pairwise incompatible 
alignments and duplications, with Ke Q E and Ke Q D a U D b , correspond 
precisely to the cliques of the incompatibility graph H . If there is no alignment 
nor duplication that is incompatible with all alignments and duplications in K 
the corresponding clique is maximal. The following maximal clique inequality is 
valid for V: 

£ Xe+ £ y d <l. (7) 

eeK E deK D 

Similarly to [6] we capture maximal sets of pairwise incompatible alignments by 
the following notation. We let 

£(k o l e ,m b <-> m e ) 

denote the collection of all sets S C E such that 

(a) all edges in S are pairwise incompatible 

(b) for each edge {vf, v 1 ^} £ S, h < I < h and m& < m < m e 

(c) S is maximal with respect to properties (a) and (b). 

Furthermore, we let 

D a (l^m) :={(v^v a q ):p<l,q>m}. 

D b is defined analogously. To show that maximal cliques in H can be character- 
ized by the following proposition, the same arguments as in [6] apply. 

Proposition 2. A clique K = Ke U Kd in H with sets Ke Q E and Ke C 
jja y jjb j s maximal if and only if either 

Ke = ®,K d =D x (£ + 1^£) 

for some 1 <£ < \G X \, x £ {a, b}, or 

K A e £(l b Hl e ,l« M), K D = D x (l b o l e ) 

for some 1 < lb < l e < \G X \, x £ {a, b}. 

In the next section we will show how this characterization can be exploited by 
an algorithm separating ([7|)- 
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Duplication island inequalities In the following we consider duplications in 
G a and define D := D a and V := V a . For duplications in G b the same holds. 
Consider the graph T' obtained by augmenting the alignment graph T with 
directed arcs A as follows. For every duplication d G D we add an arc from the 
node representing the ith element in origin(d) to the node representing the zth 
clement in targeted), for alii = 1 . . . \origin(d)\. Furthermore, for every (u, v) G A 
we let T>((u, v)) be the set of duplications d in D such that there exists an i with 
u representing the zth element in origin(d) and v representing the zth element 
in target(d). Then for any set S C V, T>(V \ S, S) denotes the set of duplications 
inducing arcs in the cut-set of (V \S,S), that is 

V(V\S,S)= |J V((u,v)), 

(u,v)eA: 

utv\s,ves 

Theorem 1. For every set S C V the following inequality is valid for V: 

m 

ves ves k=i d£V(v\s,s) 

Proof. Assume to the contrary that the sum on the left hand side of inequality |[H} 
is zero. Let graph T" be obtained from graph T' by removing all alignment edges 
whose corresponding x-variable is and all arcs (u, v) G A with yd = Q for all d G 
V((u,v)). Since every position in the genome must be covered (constraint (J3j> ) 
and since X^eS Zv + Sues Sfc=i x {v,v b } = 0, to every node v G S exactly 
one incoming arc in A must be incident. As 5Zdex>(v\s s) = ^ these arcs must 
originate at a node in S. Thus, if we repeatedly traverse, starting at an arbitrary 
node in S, the unique incoming arc backwards, we will never leave node set 
S and hence ultimately close a cycle. Due to constraint ([5]) the corresponding 
solution is infeasible. □ 

Lifted duplication cycle inequalities Again, we consider duplications in G a 
and define D := D a and V := V a . For duplications in G b the same holds. In this 
section we introduce the lifted duplication cycle inequalities, a class of constraints 
that dominate |5]). The high-level idea this class of constraints is based on is 
similar to the one underlying the lifted mixed cycle inequalities introduced in [B] . 
Consider a set of duplications CCD, which is partitioned into sets C 1 , . . . ,C f . 
If C satisfies 

(CI) for r = 1, . . . ,t, all edges in C r are pairwise incompatible 
(C2) every set {d\, . . . , d t }, where d r is chosen arbitrarily from C r for r = 1, . . . , t, 
forms a cycle according to Definition 2] 



then the inequality 



X] Vd < * ~ 1 

dec 



(9) 
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is valid for V . Inequalities (J5]) are a special case of ([9]) in which every set C r has 
cardinality one. If additionally 

(C3) C is maximal with respect to properties (CI) and (C2) 

we call ([9]) a lifted duplication cycle inequality. 

Proposition 3. An inequality of the form (j9|) with C = U*=iC 4 > C C D a , 
is a lifted duplication cycle inequality if and only if there exists a sequence of 
non-empty intervals [a\, b\], [02, 62], ■ • ■ [ot, 6 f ] such that for i = 1, . . . ,t it holds 

(PI) fldec* tar 9et(d) = [a i+1 ,b i+1 ] 

(P2) Vd £ C l : origin(d) n [a z ,b t ] ^ 

(P3) WeD\C : target(d) n [a l+1 ,b l+1 ] ^ 

origin(d) n [a it b t ] = V 3d' £ C i+1 : targeted) n origin(d') = 

where [a t +i,6 t +i] :— [01,61] and C* +1 := C . 

Intuitively, property (PI) captures condition (Cl), property (P2) ensures that 
(C2) is satisfied, and (P3) implies maximality. A formal proof is given below. 
Notice that condition (P3) is not equivalent to requiring C to be maximal with 
respect to (PI) and (P2), since a duplication satisfying (P3) might intersect 
interval [04+1,6^+1] only partially. 

Proof. To prove sufficiency assume set C has the claimed structure (P1)-(P3). 
For i = 1, . . . , t any two duplications in C l contain at least one common vertex 
in their target violating constraint (j3|) and are thus incompatible. Furthermore, 
any set of duplications {di t , . . . , di t } with d^ £ C l , i = 1, . . . ,t, forms a cycle 
according to Definition [H since due to properties (PI) and (P2) origin{di+\) D 
target(di) =/= 0. Finally, assume C is not maximal with respect to (Cl) and 
(C2). Consider a duplication d £ C such that CU{d} satisfies (Cl) and (C2). In 
particular, there exists 1 < r < t such that d is incompatible with all duplications 
in C r and thus target(d) n [a,+i,fo;+i] 7^ 0. From condition (P3) it follows that 
either origin(d) n [04,64] = 0, in which case d does not lie on a common cycle 
with any duplication from C !_1 , or 3d' £ C %+1 : target(d) D origin(d') = 0, which 
implies that there exists a duplication d' £ C n+1 such that d and d! do not lie 
on a common cycle, violating in both cases condition (C2). 

To prove necessity, we show that every set C that satisfies (Cl), (C2), and 
(C3), exhibits the claimed structure (P1)-(P3). For i = 1, . . . , t, let [04+1, 6i+i] := 
P| deCi target(d), where [04+1,64+1] := [01,61]. Due to condition (Cl), [04,64] 7^ 
and thus property (PI) is satisfied. By the definition of cycles (C2) implies that 
the origin of every duplication in C l intersects the target of every duplication 
in C 1 " 1 , i = l,...,t, where C° := C*. For intervals target(d'), d' £ C'" 1 , with 
non-empty intersection this is equivalent to Hd'ec*- 1 target{d') d origin{d) 7^ 0, 
for all d £ C % , which satisfies (P2). Finally, assume (P3) does not hold, i.e. 
there exists a duplication d £ D \ C with (i) target(d) D [04+1,64+1] 7^ 0, (ii) 
origin(d) n [04,64] 7^ 0, and (iii) Vd' £ C i+1 : target(d) n origin(d') 7^ 0. Due 
to constraint (|3]), (i) causes d to be incompatible with all duplications in C l . 
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Properties (i) and (ii) imply, by the definition of cycles (see Definition |4| , that 
for every cycle C that contains an arbitrary duplication dl 6 C™, replacing d! by 
d results in a cycle C' . Therefore C U {d} satisfies (CI) and (C2), which is in 
contradiction to (C3). □ 

4 A Branch-and-Cut Approach 

In this section we show that the three classes of valid inequalities introduced in 
the previous section can be separated efficiently At the end of the section we 
discuss further details of our implementation. 

4.1 Separation algorithms 

Theorem 2. For a given solution to the ILP for two genomes G a and G b the 
maximum weight maximal clique of alignment edges 

c* = arg max N Xij 

within the interval [Z&,Z e ] in G a and [mb,m e ] in G b can be computed in time 
0(l e bfn e b) with l e b = l e — lb + f and m e b — m e — nib + I 

Proof. In order to detect such a maximal clique we use the pair graph data 
structure which was introduced by Reinert et al. [7J Given the subgraph T" of 
the alignment graph induced by the vertex subsets vf,..., vf and v^ rib , ■ ■ ■ , v me 
the corresponding pairgraph PG(T") is a Z e & x m e b directed grid graph where arcs 
go from bottom to top and right to left. A node in in row p and column q 
of the pairgraph corresponds to the edge connecting node and v b nb+q _ 1 

in T.In the case of sparse alignment graph which is not a complete bipartite 
graph not every node in the pairgraph corresponds to an alignment edge in E. 
Those that do correspond to some alignment edge are called essential nodes. 
For every source to sink path p = ni <meb , . . . , ni ebl i in the pairgraph, the edges 
of the alignment graph corresponding to essential nodes in p form a maximal 
clique clique of conflicting alignment edges. In order to find the set c* we simply 
weight every essential node in the pairgraph by the value for the corresponding 
alignment edge variable in the actual solution and compute a longest node- 
weighted source to sink path. Since the pairgraph is directed and acyclic and the 
number of vertices and arcs is 0(l e bm e b), the longest source to sink path can be 
computed in 0(l e bfn e b) time. 

Theorem 3. For a given point (x*,y*,z*) eR l + l+m+lVl , it can be determined 
in time 0(n 3 ) whether a maximal clique inequality ([7} is violated. 

Proof. We show how to separate maximal clique inequalities that involve du- 
plications in D := D a ; For cliques containing duplications in D b a symmetric 
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argument applies. As suggested by the structure of maximal cliques (see Propo- 
sition [2]), and as described in [6], we compute for all 1 < h < l e < n (a) Ke £ 
£ (If, f>l e ,lf> to) that maximizes J2eeK E x l an d (b) J2deD(i b <^i e ) Vd- The corre- 
sponding maximal clique inequality is violated if J2eeK E x e + Y^deD(i b +±i e ) Vd > 
1. 

Concerning step (a) , we compute for each of the n — 1 possible values of h the 
longest path tree from nx >m in the pairgraph PG(T'), where T" is the subgraph 
of the alignment graph induced by the corresponding sets of vertices vf , . . . , 
and v\, . . . Computing the longest path tree takes time 0(nm), and thus 
the total time required to execute step (a) is (D(n 2 m). 

Step (b) can be performed for all pairs of genes i, j in time 0(n 2 ) by the 
following dynamic program. First we define a^j :— J2deD(i^j) Vd an d '■= 
YX=j J2dev(i,k) Vd and observe that a^ 3 = a^ij+TTij. First, for allp = 1, . . . ,n, 
we compute TT p<q , q = p, . . . , n, in the order n p , n = Y,dev( P ,n)Vd an d ^ P ,q = 
, g ) V*d in time 0(n 2 ). Then we compute in the order p-2,...,n, 
Cp.q = °~p-i,q + ^p,q-, Q = Pi ■ ■ ■ > n i which takes 0(n 2 ) time. □ 

Next we will show that a slightly relaxed version of constraint © can be 
separated efficiently. For that we define the multiplicity a(d, S) of a duplication 
d in the cutset of a cut (V \S,S): 

a(d, S) := \{(u, v) £ A : u £ V \ S,v £ S A d £ V{u, v)}\. (10) 



Theorem 4. For x £ {a, b} let D := D x , V := V x , n := |V| and m = 
|V y |, where y is the complement of x in {a, b}. For a given point (x*,y*,z*) £ 
jjl^l+l-DI+IVI^ .^ mn ^ g determined in time 0(n 3 5 \/\D\) whether the following 
relaxation of a duplication island constraint ||5J) is violated. 

m 

E< + EE*W + E *(d,S)-y* d >l (11) 

ves veSk=i dev(v\s,s) 

Proof. For an arbitrary node s £ V we let graph G S (F, A, w) contain a node Vi 
for every gene G x [i) in genome G x . Arc set A = A\ U A2, where Ai contains for 
every pair of vertices (u, v) £ V x V with T)(u, v) ^ $ A\ an arc (u, v) of weight 
u) := J2dev((u v)) y*d- ^2 contains for every v £ V with v ^ s an arc (s, u) 
of weight w(s,v) := z* + ££L X x* Then for every S C 7 with s £ V \ S 
the sum on the left hand side of inequality (fTTj) equals the weight of the cut 
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^2 w(u,v)+ ^2 w(s,v) (12) 

(u,v)EAf. (s,v)eA 2 :v€S 
u£V\S,v£S 

E E ^+EK+X>WJ ^ 

(u,»)eAi: deV((u,v)) vGS \ fc=l / 

uev\s, ties 

m 

dev(v\s,s) ves «eSfe=i 

The last step follows directly from the definition of a(d, S), see (fTU|) . Determining 
set S 1 * that minimizes the left hand side of inequality (|11[) is thus equivalent to 
computing the minimum s — t cut in G s , over all s 6 This can be reduced 
to 2\V\ — 2 maximum flow problems, i.e. from an arbitrary node s to all t ^ s 
and from all t ^ s to s, each taking time 0(|F| 2 y^j4]) using Goldberg- Tarjan's 
preflow push-relabel algorithm. □ 

We next show how to separate a certain relaxation of the lifted duplication 
cycle constraints efficiently. The high-level idea of the algorithm is to construct a 
graph, whose nodes represent elements that satisfy (PI) and whose edges connect 
intervals that satisfy (P2). Similarly to the separation of lifted mixed cycles in 
the multiple sequence alignment problem [6], a potentially violated constraint 
in the relaxed form of a lifted duplication cycles is then obtained by a shortest 
path computation. 

Theorem 5. For x € {a,b} let D := D x , V := V x , n := \V\ and m = 

\V y \, where y is the complement of x in {a, b}. For a given point (x*,y*,z*) S 
Kj E ' + '' D ' + '^ , it can be determined in time 0(n 3 + \D\n 2 ) whether the relaxation 
of a lifted duplication cycles ([9]), in which for every interval [ai, bi] in Proposition 
di — bi, for all i = 1, . . . ,t, is violated. 

Proof. We construct an arc weighted graph G — (V, A, w) as follows. Similar 
to the alignment graph we have one node for every gene in the given genome. 
For every pair of nodes ut and Vj we compute the set of duplications T>(i,j) 
whose origin contain Vi and whose target contain Vj, i.e. T>(i,j) :— {d G D : i G 
origin(d) A j G targeted)}. For every non-empty set X>(?,j) we add an arc from 
node Vi to node Vj . We define the weight of an arc as 

dev(i,j) 

The the violation of a lifted duplication cycle having the claimed structure, given 
by the sequence of nodes Vi t , Vi 2 , . . . , Vi t , with Vi j G [aj ,bj], is 



{V\S, S) in G s : 

E w ( a ) = 

(u,v)eA 1 UA 2 : 
u£V\S,v£S 



E i/2-*+i = i-E(i- E ^) = i-E^(K>^ +1 ))> 

j=ldeCJ j=l deCJ 3 = 1 
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where Vi t+1 := v^. Note that sets T>(i,j) satisfy (P1)-(P3) and thus the last 
inequality follows. The most violated lifted duplication cycle of the relaxed kind 
can therefore be obtained by computing the shortest arc-weighted path in G 
from every node v to itself (if it exists). 

Implemented naively, the weight of the arcs in A can be determined in 
0(\D\n 2 ). Note that due to constraint [3] the arc weights are all non- negative 
and we can compute the shortest paths by Dijkstra's algorithm. Since graph G 
has 0(n 2 ) arcs and Dijkstra's algorithm is called n times, the shortest cycle in 
G can be found in time 0(n 3 ). 

5 NP-hardness 

The reduction is from the decision version of MAX-2SAT, which is defined as 
follows. Given a boolean formula cf> in conjunctive normal form with variables 
Xi,...,x n , and clauses Ci,...,C m , where each clause Ci is a disjunction of 
exactly 2 literals, and a positive integer fc. Decide whether there exists a truth 
assignment that satisfies at least fc clauses. It is well known that the decision 
version of MAX-2SAT is NP-complete. 

We construct gadgets for each variable and each clause. We start with a 
description of the variable gadgets. 

Variable Gadget: For a variable Xi we let vri{ denote the number of clauses 
that the variable appears in. The gadget for variable Xi consists of two strings 
s\ , s\ of length Ami each. If xi appears in clauses Ci 1 , . . . , Ci m . we set 



Lemma 1. The optimal cost of an alignment of the two strings s\, s\, form- 
ing the variable gadget of a variable Xi is Ami ■ There exist exactly two optimal 
alignments that do not use duplications. 

Proof. Neither of the two strings s\, s|, contains at least 2 consecutive charac- 
ters that appear at least twice in the string. Therefore there always exists an 
optimal solution that does not use any duplication. Such a solution is obtained 
by maximizing the number of matchings in the alignment. If any character Xi or 
c\ in s\ is matched to any occurrence of that character in s^, none of the char- 
acters Xi or c\. can be matched, and vice versa. Thus an alignment that matches 
all characters Xi and all c\ , respectively all characters Xi and all c\ . , by aligning 
the fcth character of s\ [1 . . . 2nn] to the fcth character of s\ [2m, + 1 . . . 4m<] , 
respectively the fcth character of s\ [2rrii + 1 . . . 4m, ] to the fcth character of 
s\[l . . . 2mi], 1 < fc < 2m,i, is optimal. Furthermore, an alignment that matches 
all characters c\ . or all characters c£ . , 1 < j < mi, only allows a unique matching 
of characters Xi, respectively Xj. □ 




X^C^ ' ' ' XiCi^ XiCi^ • • • XiCi 
X^C^ ' ' ' XiCi^ XiCi^ • • • XiCi 



Duplication-Loss Alignment by Cutting Planes 13 



An alignment that matches all characters X\ and all c\. of a variable gadget is 
said to be in FALSE configuration, and an alignment that matches all characters 
Xi and all c\. of a variable gadget is said to be in TRUE configuration. Next we 
show that the variable gadgets can be independently set to a TRUE or FALSE 
configuration. 

Lemma 2. The cost of an optimal alignment of strings 

Y = s 2 S 2 ' ' ' s 2 

is 8m, where each variable gadget is in TRUE or FALSE configuration. 

Proof. By Lemma [TJ an alignment that set each variable gadget arbitrarily to 
a TRUE or FALSE configuration has an overall cost of Y17=i = & m - Fur- 
thermore, an optimal alignment of substrings X and Y is obtained by optimally 
aligning the substrings of the each variable gadget independently since the sets 
of characters that appear in different variable gadgets are disjoint. By Lemma [T] 
the claim follows. 

Clause Gadget: The gadget for a clause Ci — £^ V £i 2 is composed of two 
strings t\, t\, of length 4 each. If 1^ is a variable Xj, we set t\[l . . .2] = xjc?, if 
£i 1 is the negation of a variable Xj, i.e. Xj, we set t\[l . . . 2] = XjC?. We define 
t\ [3 ... 4] as a function of literal 4 2 analogously. We set t\ [1 . . . 2] = t\ [3 . . . 4] and 
4 [3 . . .4] = t\[l . . . 2]. As an example consider a clause Ci of the form Xj V Xk- 
Then 

Next we show a one-to-one correspondence between the optimal cost of a duplication- 
loss alignment instance that is composed of the variable gadgets and a single 
clause gadget, and the evaluation of the clause under the implied truth assign- 
ment. 

Lemma 3. Consider the two strings 

5 

X = S-^ S^ ' ' ' S-^ $ * * * $ t-y 

Y^ — S<2 S<2 ' ' ' S<2 $ * ' * $ t<2 
5 

obtained by concatenating all variable gadgets and the clause gadgets for a clause 
Ci, separated by string $$$$$. The cost of an optimal alignment of X andY that 
sets all variable gadgets in TRUE or FALSE state is 8m if Ci is satisfied under 
the truth assignment implied by the variable gadgets and 8m -I- 2 otherwise. 
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Proof. Without loss of generality we assume that xj occurs positive, and Xk 
occurs negative in C\, i.e. C\ — Xj V x~k- The other 3 cases can be covered 
analogously . Consider strings X and Y. No two characters of t\ and t\ can 
be matched simultaneously to a character in s\ s\ ■ ■ ■ s£ , respectively s\ s\ ■ ■ ■ s™, 
since the alignment edges would cross. Therefore, at most 4 characters in the 
clause gadget can be matched to characters in the variable gadgets. In this case 
none of the characters $ can be matched. Replacing the at most 4 matchings by 
duplications and losses will increases the cost by at most 8, while matching all 
characters $ decreases the cost by 10. Thus an optimal alignment will not match 
any character in the clause gadget to a character in one of the variable gadgets. 

If d is not satisfied, i.e. the variable gadget for Xj is in FALSE configuration 
and the variable gadget for Xk is in TRUE configuration, no two consecutive 
characters of t\ (t\) can be duplicated, since the characters in any of their oc- 
currences are matched in the variable gadgets. On the other hand, matching 
characters Xjc[ or XkC^ in the clause gadget and covering the remaining charac- 
ters by duplications from the corresponding substrings in the variable gadgets 
causes an additional cost of 2. Since at most 2 characters can be matched in 
a clause gadget and no 3 or more consecutive characters in t\ {t\) occur in 
s i s i ' ' ' s i ( s 2 s 2 ' ' ' s 2 )j this is optimal. 

If both literals of Cj evaluate to TRUE, i.e. the variable gadget for Xj is in 
TRUE configuration and the variable gadget for Xk is in FALSE configuration, 
both XjC? and Xkcf are unmatched in both s\ ■ ■ ■ s" and s\s\ - ■ ■ . Characters 
XjC? in a variable gadget can be a duplication of the corresponding characters in 
the clause gadget and vice versa, but one duplication invalidates the reverse. The 
same holds for Xfccf . Furthermore, reverse duplications contribute equally to the 
total cost of the solution. Thus if we cover all characters in the clause gadgets by 
duplications we incur an additional cost of 4. However, if we arbitrarily choose 
to match xjc? or x^cf in the clause gadget, the same characters in the variable 
gadgets can be the product of a duplication and the total cost reduces by 4 to 
8m + 4 -4. 

If one literal of Ci evaluates to TRUE and the other to FALSE, the argument 
is the same as in the previous case, except that instead of choosing the charac- 
ters to match in the clause gadget arbitrarily, we match the characters whose 
corresponding literal evaluates to TRUE. □ 

Finally we construct an instance to the duplication-loss alignment problem by 
concatenating all variable and clause gadgets, separated in the following way: 
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Lemma 4. Consider the two strings 

5 

,1 „2 



V 

5 2 s 2 ' ' ' 3 



Y = 4 " 2 




The cost of an alignment of X and Y that sets all variables gadgets in TRUE 
or FALSE state is 10m — 2k, where k is the number of clauses satisfied under 
the implied truth assignment. 

Proof. The proof is by induction on the number of clause gadgets q. We claim 
that the optimal cost of an alignment of X and Y restricted to the left-most q 
clause gadgets has cost 8m + 2(q — k), where k is the number of clause gadgets 
among the q left-most clause gadgets whose corresponding clause is satisfied 
under the implied truth assignment. We also show that in an optimal solution no 
character of any variable gadget is matched to a character of any clause gadget. 
The base case (q = 1) holds by Lemma[3]and the construction in the proof of the 
same lemma. To show the induction step, assume that the claim holds for q = £. 
To show that the claim holds for q = I + 1 , we observe that no two characters of 
t[ +1 and t e 2 +1 can be matched simultaneously to a character in Y , respectively 
X , since the alignment edges would cross. Therefore, at most 4 characters in 
the clause gadget for variable X£ + i can be matched to characters in strings X e 
and Y e . In this case none of the characters $^+i can be matched, since the only 
occurrence of characters $£+i is directly preceding t^ +1 and t 2 +1 . Replacing the 
at most 4 matchings by duplications and losses will increase the cost by at most 
8, while matching all characters $t+i decreases the cost by 10. Thus an optimal 
alignment will not match any character in the clause gadget to a character in 
X e , respectively Y e . As no two consecutive characters in t{ +1 or t 2 +1 appear 
in any other clause gadget, there always exists an optimal solution that does 
not contain any duplication between the gadget for clause CV+i and any other 
clause gadget . Furthermore , substrings t[ + 1 [ 1 . . . 2] , t { + 1 [3 . . . 4] , and t e 2 + 1 [ 1 . . . 2] , 
^2 +1 [3 ... 4] appear exactly once in X e , respectively Y l , and do not intersect any 
occurrence of a sequence of at least two characters appearing multiple times 
in X^, respectively Y e . Therefore, and due to the structural assumption of the 
induction hypothesis, in an optimal alignment of X e and Y , the unmatched 
characters in the unique occurrences of these substrings in X 1 and Y l will be 
covered by losses. Therefore, by the same arguments as in the proof of Lemma|3l 
an optimal alignment of X e+1 and Y l+1 incurs no additional cost compared 
to an optimal alignment of X and Y l if clause Ce+i is satisfied under the 
implied truth assignment, and an additional cost of 2 otherwise, summing to 
an overall cost of 8m + 2(£ — k) = 8m + 2((£ + 1) — (k + 1)), respectively 
8m + 2(£-k) + 2 = 8m + 2((£ + l)-k). □ 

Theorem 6. The duplication-loss alignment problem is NP-hard. 

Proof. The proof follows from Lemma |4] and the fact that the decision version 
of Max- 2 SAT is A^F-hard. 
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6 Experimental results 

In this section we present the preliminary results of the comparison between the 
branch-and-cut algorithm as outlined in Sectiorj4] and the iterative ILP formu- 
lation suggested by Holloway et al. in terms of run time. We implemented both 
approaches in C++ and used the CPLEX solver version 12.4 as ILP solver. All 
experiments were run single threaded on a 2.67 GHz Intel Xeon cpu. 

For the implementation of the graphs we used the lemon graph library [8] 
and the seqan library [9] that provide the standard graph algorithms (max-flow- 
min-cut, dag shortest path, Dijkstra). To detect all duplication cycles induced 
by an intermediate solution in the iterative ILP approach we first construct a 
digraph with a node for every duplication event di £ D a (resp D ). We insert 
a directed edge (t>i,Vj) if origin(dj) intersects target(di). Then we weight every 
edge (vi, Vj) with the value (1 — j/j) where yi is the duplication variable for di. In 
this directed graph we then enumerate all cycles with weight strictly less than 
1 (same argument as for the lifted cycle separation). The cycles are enumerated 
using a slight variant of the DFS-based algorithm by Johnson [TO]. For every 
detected cycle the corresponding violated duplication cycle constraint is added 
to the ILP before it is solved again. 

For the branch-and-cut approach we utilized the user cut interface shipped 
with the ILOG CPLEx/Concert library. For both algorithms we used the default 
solver settings and measured the run time to compute an optimal solution. 

We used the same scoring scheme as Holloway et al. where alignments of 
homologous genes have cost 0, while every single gene loss and every duplication 
event is charged a cost of 1. Obviously there may exist multiple optimal solutions 
for some instances therefore both algorithms not necessarily report the same 
solution. 

For the benchmark we used two types of data, real world data and simulated 
data. 



Real-world instances We compared the two approaches on two sets of real- 
world instances that were also used in pQ. The sets contain the stable tRNA 
and rRNA contents of 12 Bacillus and 6 Vibrionaceae lineages that were pre- 
processed like discussed in [1] . So they are linearized according to their origin of 
replication and inverted segments are manually re-inverted. For both sets we ran 
both algorithms for all pairs of genomes leading to 66 pairs for Bacillus and 15 
pairs for Vibrionaceae. The average run time of the direct iterative ILP algorithm 
on the Bacillus instances was around 19 seconds, while the our branch-and-cut 
algorithm took less than 1.5 seconds. 

For the Vibrionaceae pairs, the advantage of our algorithm is even more 
prominent as the ILP did not finish after several days on some instances, while 
the branch-and-cut algorithm always needed less than one hour - on most in- 
stances only a few minutes. A more detailed benchmark on this dataset is still 
in process. 
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Simulated instances The simulation of input instances follows the strategy of 
Holloway et al. The simulation is performed in the following steps. First a random 
sequence R of length n and alphabet size a is simulated where the alphabet 
symbols at each position are iid. In the second step I moves (single gene loss 
or duplication event) are applied to R where the length of a duplication follows 
a Gaussian distribution with mean 5 and standard deviation 2 and the start 
position of every move is uniformly distributed. This sequence is then used as 
the ancestor genome X and two extant genomes are generated by again applying 
I moves to X for each of them. In Table [T] we present run time results for several 
settings of parameters n, I, and a where we simulated 50 instances for each 
setting. 



Table 1. Benchmark on simulated data 



Setting 


avg. run time in sec. 


(n,l,a) 


branch-and-cut 


iterative ILP 


(100,10,50) 


0.3 


8.9 


(200,20,100) 


1.5 


149.4 


(400,40,20) 


7.0 


2499.8 



(see text for details) 



7 Conclusion 

The preliminary results from the run time comparison show that the branch- 
and-cut algorithm clearly outperforms the ILP. In particular for larger instances 
(bigger n) and pairs of rather distant genomes like in the Vibrionaceae dataset, 
the improvement in terms of run time is immense. Therefore the branch-and- 
cut algorithm allows to solve more difficult instances than the pairs of Bacillus 
genomes on a desktop pc and does not require compute clusters to solve the 
instances in a reasonable amount of time. 
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